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boundary,  lower  bounds  are  obtained  on  the  AT  measure  (which  subsume  bisection 
bounds  as  a  special  case) .  When  information  exchange  is  due  to  the  storage 
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the  AT  measure. 
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1.  Introduction 


Considerable  attention  has  been  devoted  in  recent  years  to  the  establishment 
of  lower  bounds  to  the  principal  measures  of  performance  of  VLSI  circuits,  that 
is,  chip  area  A  and  computation  time  T.  Typically,  these  lower  bounds  are  in 
the  form  of  area-time  tradeoffs  and  are  based  on  minimum  requirements  on  the 
amount  of  information  that  must  cross  suitably  chosen  sections  of  the  circuit  chip. 

Lower  bounds  to  performance  measures  are  valid  within  a  well-defined 
computation  model.  In  keeping  with  common  practice,  in  this  paper  we  adopt  the 
so-called  synchronous  VLSI  model. 

A  computational  problem  H  is  a  boolean  mapping  from  a  set  J  to  a  set  Q  of 
input  and  output  variables,  respectively.  The  mapping  embodied  by  II  is  realized 
by  a  boolean  machine  described  as  a  computation  graph,  G  *  (V,E),  whose  nodes 
V  are  information  processing  devices  or  input/output  ports  and  whose  arcs  E  are 
vires . 

A  VLSI  chip  is  a  two-dimensional  embedding  of  this  computation  graph, 
according  to  the  prescriptions  of  the  model.  The  model  is  characterized  by  a 
collection  of  rules  concerning  layout,  timing,  and  input/output  (I/O)  protocol; 
in  addition,  the  model  restricts  the  class  of  computation  graphs  to  those  having 
bounded  fan-in  and  fan-out  (without  loss  of  generality,  this  bound  is  assumed  to 
be  two) . 

The  layout  rules  are: 

1.  Wires  (arcs)  have  minimum  width  \  and  at  most  v  wires  (  j  >_  ?.)  can 
overlap  at  any  point; 
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2.  Nodes  have  minimum  area  ca  ,  for  some  c  1. 

No  loss  of  generality  is  incurred  if  the  layout  is  restricted  to  be  an  embedding 
of  the  computation  graph  in  a  uniform  grid,  typically  the  square  grid:  the  latter 


is  the  plane  grid  the  vertices  of  which  have  integer  coordinates  (layout  grid) . 

The  layout  rules  may  contain  the  additional  specification  that  all  I/O  ports  be 
placed  on  the  boundary  of  the  layout.  Chips  obeying  this  constraint  are 
referred  to  as  boundary  chips;  unless  otherwise  noted,  we  shall  consider 
unrestricted  chips,  where  I/O  ports  can  occur  anywhere  in  the  layouts. 

The  timing  rules  specify  that  a  bit  requires  a  fixed  time  (hereafter, 
assumed  equal  to  1)  to  propagate  along  a  wire,  irrespective  of  its  length 
( synchronous  system).  Finally,  the  I/O  protocol  is  semellective  (each  input  is 
received  exactly  once),  unilocal  (each  input  is  received  at  exactly  one  input 
port),  time-  and  place-determinate  (each  I/O  variable  is  available  at  prespecified 
port  and  time,  for  all  instances  of  the  problem). 

For  a  given  problem  II ,  a  tradeoff  between  the  chip  area  A  and  the  computation 
time  T  is  conveniently  expressed  by  the  family  of  functions 


A-a(T).  T  €  [T  (n),T  (n) ] 

n  min  iqsx 


where  a  is  the  input  size,  a  (T)  is  the  area  of  the  smallest  design  that  solves 

□ 


IT  in  time  T,  T  is  the  minimum  time  required  to  solve  II  (regardless  of  the  area), 
min 


and  T  is  a  time  such  that,  for  T  >  T  ,a  (T)  is  constant  with  respect  to  T. 
max  max  a 


Equivalently,  a  tradeoff  is  expressed  as  a  relation  g(A,T)  »  ?,(f(n)),  for  a 
suitable  function  g. 

To  date,  almost  all  known  area-time  lower  bounds  belong  to  one  of  the  three 
following  classes: 

(1)  Incut-output  flow  bounds.  They  are  of  the  form  AT*  !(ljl  +  i£i ) , 
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and  follow  directly  from  the  fact  that  the  maximum  number  of  bits  that  can  be 
exchanged  with  the  exterior  in  a  time  unit  is  proportional  to  the  number  of 
I/O  ports,  which  in  turn  is  at  most  proportional  to  the  chip  area. 

2  2 

(2)  Internal-flow  bounds.  They  are  typically  of  the  form  AT  -  0(1  ) 
where  I  is  the  bisection-information  of  II,  i.e.  the  minimum  amount  of 
information  that  must  be  exchanged  across  any  section  that  equlpartitions  the 
set  V  [I].  When  I  is  known,  the  bound  follows  from  the  fact  that  the  area  ia 
at  least  proportional  to  the  square  of  the  minimum  bisection  width  b,  and  that 
the  number  of  bits  that  can  be  exchanged  across  this  bisection  in  a  time  unit  is 
at  most  proportional  to  b. 

(3)  Node-degree  bounds.  To  our  knowledge,  this  type  of  bound  has  been 

developed  only  In  connection  with  integer  addition  [2,3]  and  can  be  stated  in 
the  form  AT/logA  »  Such  bound  hinges  on  the  fact  that,  since  the 

computation  time  of  output  functions  depends  on  the  number  of  their  arguments, 
information  must  reside  within  the  chip  for  a  certain  duration. 

In  this  paper,  with  the  intent  to  take  a  further  step  toward  the  development 
of  a  coherent  theory  of  VLSI  complexity,  we  develop  a  finer  analysis  of  internal 
flow  by  considering  subdivisions  of  the  chip,  which  are  more  demanding  on  the 
information  flow  than  balanced  bisection. 

In  Section  2,  we  introduce  a  novel  technique,  called  "square  tessellation", 
which  is  based  on  a  partition  of  the  chip  into  square  cells  of  equal  size, 
and  on  the  information  exchanged  across  the  boundary  of  these  cells. 

When  the  mechanism  forcing  the  information  exchange  is  the  functional 
dependence  between  variables  that  are  input  inside  and  output  outside  a  tessellat 
cell  (or  vice  versa),  the  tessellation  technique  leads  to  bounds  on  the  AT“ 
measure  (of  which  bisection  bounds  are  a  special  case) . 
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We  also  identify  a  new  mechanism  of  exchange,  due  to  storage  saturation 
of  tessellation  cells,  which  leads  to  bounds  on  the  AT  measure.  Indeed, 
storage  limitations  may  affect  the  information  exchange  in  several  ways.  For 
example,  during  the  computation,  a  cell  may  fill  its  storage  (a  situation 
referred  to  as  "saturation")  and  hence  be  forced  to  send  some  information 
outside,  just  for  temporary  storage,  and  to  receive  it  back  at  a  later  time. 

In  other  situations,  some  information  input  outside  of  a  cell  is  needed  by  the 
cell  at  several  different  times.  If  the  amount  of  this  information  exceeds  the 
storage  capacity  of  the  cell,  the  information  must  be  transmitted  through  the 
boundary  each  time. 

It  must  be  pointed  out  that,  although  both  the  input/output  flow  and  the 

internal  flow  due  to  saturation  lead  to  AT  bounds,  the  phenomena  involved  are 

completely  different.  Indeed,  input-output  flow  occurs  through  a  two-dimensional 

2 

section  and  has  the  physical  dimensions  of  bits/(length)  ,  while  internal  flow 
occurs  through  a  one-dimensional  section  and  is  expressed  in  bits/length. 

When  considering  either  I/O  or  internal  flow,  information  is  essentially 
viewed  as  a  fluid  whose  flow  is  uniquely  constrained  by  the  available  capacity 
(bandwidth).  However,  in  some  computations,  the  I/O  flow  is  kept  below  capacity 
by  the  fact  that  information  has  to  be  transformed  before  being  output,  and  that 
this  transformation  cannot  be  instantaneous  due  to  the  bounded  fan-in  and 
fan-out  of  the  gates.  In  the  context  of  the  fluid-dynamic  analogy  of  VLSI 
computation,  it  seems  appropriate  to  call  this  mechanism  "computational  friction 
*  general  framework  for  this  phenomenon  is  also  developed  in  Section  2. 


In  Section  3  we  illustrate  the  techniques  introduced  in  Section  2  by 
2 

deriving  new  AT  ,  AT  and  AT/logA  bounds  for  the  problem  of  sorting  n  k-bit 
keys.  Each  bound  is  interesting  since  it  dominates  the  other  two  in  a  suitable 
range  of  key  lengths  and  computation  times.  In  Section  4,  we  briefly  mention 
analogous  results  in  relation  to  the  problem  of  cyclic  shift,  merging,  and 
record  sorting. 

After  completion  of  the  work  reported  in  this  paper,  we  have  learned  that 
2 

some  AT  bounds  based  on  the  notion  of  tessellation  had  been  previously  derived 

2 

in  unpublished  work  by  Angluin  [4],  and  that  Siegel  [12]  reports  the  same  AT 
tessellation  bound  as  our  Theorem  7. 

We  also  mention  that  partitions  of  the  layout  into  multiple  regions 
(different  from  square  tessellation)  have  been  used  in  [5]  to  study  the 
area-time  complexity  of  multilective  protocols. 
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!•  Square  Tessellations  and  Lower  Bounds  on  Performance  Measures 

The  general  background  of  all  considerations  developed  in  this  section  is 
a  partition  of  the  laid-out  computation  graph  into  two  portions,  which,  when 
appropriate,  will  be  identified  with  two  processors  and  cooperating  to 
solve  the  given  problem  H. 

The  partition  may  refer  either  to  topological  properties  of  the  graph 
(sets  of  vertices  and  edges),  or  to  its  computational  properties  (set  of  I/O 
variables) .  The  two  cases  are  referred  to  as  "dichotomy”  and  "I/O  assignment" 
respectively,  as  expressed  by  the  following  definitions: 

Definition  1.  If  V  »  f)  (J  &  is  the  set  of  I/O  variables  of  II,  an  I/O 
assignment  for  II  is  a  partition  ^ • 

Definition  2.  Given  a  graph  G  ■  (V,E),  a  dichotomy  of  G  is  a  partition 
(YV  of  v  ;  {(V^.V^),  called  dichotomy  width,  is  the  number  of  edges  connecting 

\  c°  Y 

The  square  tessellation,  to  be  introduced  next,  is  a  device  that  we  use  to 
generate  partitions.  Let  the  auxiliary  grid  be  the  translation  of  the  layout 
grid  by  the  vector  * 

Definition  3.  A  square  tessellation  with  sidelength  l  is  a  subgraph  of  the 
auxiliary  grid  formed  by  two  sheaves  {x  ■  +  jZ  :  j  >_  0}  and  {y  =  H  +  ji  :j  >_  0} 

of  evenly  spaced  rows  and  columns  with  spacing  l. 

A  square  tessellation  is  a  partition  of  the  plane  into  identical  l  •<  i 
tiles  called  calls ,  i.e.  it  is  a  geometric  partition.  It  can  be  used  to  produce 
a  dichotomy  (and  the  associated  I/O  assignment)  by  identifying  an  individual 
ceil  with  one  term  of  the  dichotomy,  and  the  rest  of  the  layout  with  the  other 
term  .  The  outstanding  feature  of  the  tessellation  technique  is  that  it 
permits  accounting  for  the  simultaneous  presence  of  all  other  cells. 


j  .■  v  -■ 


We  begin  by  obtaining  a  lower  bound  to  the  area  A  of  the  layout  in  terms 


of  (topological)  properties  of  the  graph. 

2.1.  Lower  bounds  on  the  area  of  the  chip 

Given  a  graph  G  ■  (V,E),  and  an  integer  0  <  m  <  j V | .consider  the  set 
rm  “  {(vi»v2):  1 vi I  "  “J  and  let 


5(m)  -  min  6(V  ,V  ),  (1 

(V  .V  ) 6  r 
i  l  m 

i.e.,  <5(m)  is  the  smallest  dichotomy  width  over  all  dichotomies  of  V  with 
fixed  ratio  j V  | / 1 V  J .  Thus,  <5(L|vj/2j)  is  Thompson's  minimum  bisection 
width  [1].  We  can  now  state  the  following  theorem: 

Theorem  1 .  For  every  graph  G  »  (V,E)  and  every  m  <  | V j 


Proof :  We  position  the  layout  rectangle  R  with  its  southwest  corner  at 
and  partition  it  by  means  of  a  square  tessellation,  with  spacing 
l  ■!_(  5(a)  -  l)/4j  .  Both  sides  of  R  have  length  at  least  o(m)-l.  In  fact,  we 
can  cut  R  by  means  of  a  vertical  two-bend  polygonal  line  on  the  auxiliary  grid 
(zig-zag  line),  into  two  polygons  one  of  which  contains  exactly  m  vertices. 
Thus,  at  least  5(o)  edges  cros3  the  cut,  5(m)-l  of  which  are  horizontal,  and 
the  vertical  side  has  length  at  least  5(m)-l.  A  similar  argument  applies  to 
the  horizontal  side.  Then  R  contains  at  least  16  cells  of  the  tessellation 
and  the  smallest  tessellation  rectangle  R1  containing  R  has  area  at  most 
25/15  times  as  large  as  the  area  of  R.  We  now  claim  that  each  cell  of  R' 
contains  fewer  than  a  vertices  of  G.  Indeed,  if  a  cell  contains  a  or  more 
vertices  we  cut  it,  by  means  of  a  zig-zag  line,  into  two  polygons  one  of 


which  contains  exactly  m  vertices;  this  polygon  has  perimeter 

p  <_  Ul  •  4|_(6(m)  -  1)/4J  <_  6(m)  -  1,  so  that  fewer  than  <5(m)  edges  cross  the 

ceil  boundary  contrary  to  the  definition  of  <5(m).  It  follows  that  at  Least 

n v I /ml  cells  of  R'  contain  some  node  of  G,  and  therefore  overlap  with  R. 

The  area  of  R1  is  at  least  as  large  as  the  global  area  I"  |V|/ml  (  l(<$ (m)-l)4j )  ^ 

of  these  cells,  so  that  A  is  at  least  16/25  of  it.  □ 

The  best  lower  bound  to  the  area  of  a  layout  of  G  is  obtained  for  the 

2 

value  nig  of  m  that  maximizes  the  ratio  5  (m)/m.  For  most  of  the  computation 

2 

graphs  considered  in  the  literature  mQ  —  | V |  /2 ,  yielding  A  -  ft(<S  (m)).  This 

accounts  for  the  success  of  bisection  techniques.  We  shall  see  later  that  for 

2 

the  computation  graph  of  some  significant  problems  -  such  as  sorting  -  5  (m)/m 
achieves  its  maximum  for  values  of  m  considerably  smaller  than  'V^. 

Since  the  graphs  to  be  considered  in  this  paper  are  computation  graphs, 

ic  is  convenient  to  establish  a  link  between  the  notions  of  dichotomy  and  of 

I/O  assignment.  Specifically,  in  a  computation  graph  G  a  subset  U  of  V  is  the 

set  of  I/O  ports,  and  v  €  U  handles  <P(v)  variables  during  the  computation;  in  other 

words,  an  integer-valued  function  cp  on  V  adequately  identifies  the  I/O  ports, 

each  with  its  multiplicity.  The  fact  that,  for  some  v  €  U  we  have  cp(v)  >  1, 

e 

introduces  a  "coarser  granularity"  in  the  partition  of  the  I/O  variables,  which 
must  be  dealt  with. 


To  achieve  a  possibly  useful  broader  generality,  for  G  =>  (V,E)  we  consider 


arbitrary  weighting  functions  (p;V  -  U  (the  nonnegative  integers),  and  denote 
by  (0  *  max{tp(v)  :v  €  V;  and  by  M  *  I  cp(v)  ,  the  global  weight.  For  an 

2clX  '  £  TT 

v  £  V 

integer  m  <  h,  define  the  following  class  of  dichotomies: 


iCV.  ,V_)  :m  -  <p  +1  < 


cp(v)  <  m; 


(3) 
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copes  with  the  input  granularity,  as  expressed  by  cp 

max 


Intuitively,  a  weighting  function  models  a  general  distribution  of  input/ 
output  variables,  and  the  generalized  definition  (3)  of  class  of  dichotomies  ‘ 

By  a  straightforward 
modification  of  the  proof  of  Theorem  1,  we  can  then  establish: 

Theorem  2.  For  a  graph  G  -  (V,E),  let  <p  be  a  weighting  function  (p  on  V 
with  parameters  M  and  (pma-r>  and  let  T  satisfy  (3).  Define 

min  S(VltV2). 


Then 


(4) 


Note  that  when  cp(v)  ■  (p  ■  1  for  every  v  €  V,  M 

max 


VI,  T  becomes  T  , 
m 


and  (4)  subsumes  (2).  It  must  be  stressed  that  the  notion  of  dichotomy 

pertains  to  a  given  graph;  we  shall  now  see  how  relation  (4)  will  be  instrumental^ 

to  obtain  area-time  lower  bounds  on  all  graphs  solving  a  given  a. 

2  * 

2.2.  Area-time  lower  bounds  based  on  information  exchange  (AT  -theory)  •' 

In  this  section  we  shall  consider  a  graph  G  ■  (V,E)  as  a  computation  graph 

solving  problem  H  in  time  T,  and  we  shall  establish  a  lower  bound  to  its  layout 

area  as  a  function  of  T  and  of  parameters  of  a. 

The  starting  point  of  the  argument  is  a  dichotomy  (V^.V^) 

dichotomy  naturally  identifies  two  processors  P.  and  P^,  where  ?.  is  defined  as  •! 

i  2  i 

the  subgraph  induced  by  vertex  set  (i  a  1,2).  In  turn,  if  we  define  V ^ 

as  the  I/O  variables  of  a  handled  by  processor  ?  (i  =  1,2),  then  we  obtain  an 

I/O  assignment  (V.  ,y„)  .  Such  assignment  is  characterized  by  a  very  important  / 

12 

parameter,  called  "information  exchange",  as  expressed  by  the  following 

h ' 

definition  (similar  definitions  have  been  considered  by  many  authors,  e.g. 

[1,6, 7, 8]): 


. . 


vlvj 


Is  defined  as 


r  2 


K^i,  V  »  the  minimum  over  all  the  algorithms  (which  solve  II  under 
assignment  C^.^^of  the  maximum  over  all  the  problem  instances  of  the  Dumber 
of  bits  exchanged  between  and  P^. 

In  other  words,  for  any  algorithm  that  solves  II  under  (V  )  there  is  at 
least  a  problem  instance  for  which  P^  and  P2  exchange  1(^,1^)  or  more  bits,  and 
no  integer  larger  than  I(^» enjoys  the  same  property. 

In  the  following,  can  be  an  arbitrary  member  of  a  class  H  of 

assignments,  so  that  we  need  to  lower-bound  the  information  exchange  for  all 
members  of  this  class.  The  argument  is  completed  by  considering  a  class  of 
dichotomies  of  an  arbitrary,  but  fixed,  graph  G  solving  U  in  time  T,  and  the 
corresponding  class  H  of  I/O  assignments,  and  by  bounding  the  dichotomy  width 
of  G  in  terms  of  the  information  exchange  of  H  and  the  time  T. 

More  formally,  for  a  class  H  of  I/O  assignments  for  H,  we  define 

min  I(r  ,V,)  (5) 

(1ri,r2>€H 

We  now  relate  5^,  of  a  given  class  T  of  dichotomies  of  G  with  the 
corresponding  class  H  of  assignments: 

Lemma  1 .  Let  T  be  a  class  of  dichotomies  of  a  graph  G,  which  solves  II 
in  time  T.  Let  H  «  C^,^)  corresponds  to  (V^,V2)  ^  "}•  Then: 


Proof .  Let  (V^,V2>  €  T  be  a  dichotomy  of  V  and  (V^,V the  corresponding 
assignment.  Then  must  be  able  to  exchange  1(7^,*^)  1^  bits  with  in 

time  T  and  therefore  must  be  connected  to  V2  by  at  least  I^/T  edges.  □ 


1^  of  the 


11 


n 


- 1 


Let  l(C  V  be  a  set  of  I/O  variables  of  II  and  ®(v)  the  number  of  variables 
of  V,  handled  by  node  v.  As  the  class  T  we  consider  the  one  defined  by  relation 
(3),  for  m  <_  |t(J  and  tp  <_  T,  and  obtain  the  following  result: 


Theorem  3.  Let  T  be  a  class  of  dichotomies  of  G  (which  solves  n  in  time  T)  v, 

satisfying  (3)  for  m  <  llA  and  <p  <  T.  Letting  H  ■  { (V  ,V  ) :  (Ir.  ,y  )  correspond.—* 

—  max  l  /  l  i.  ,  i 

to  CV^ » V2^  ^  r^»  we  have 

AT2  -  fl 


Proof :  Immediate,  by  combining  Theorem  2  and  Lemma  1.  □ 

2 

Theorem  3  is  the  cornerstone  of  the  so-called  AT  -theory  of  VLSI  computation 
and  has  far-reaching  consequences.  The  reasons  are  that  for  most  computational 
problems  we  are  able: 

(i)  to  characterize  the  class  H  corresponding  to  T  that  satisfies  (3); 

(ii)  to  compute  or  bound  1^. 

To  make  the  tradeoff  more  explicit,  we  note  that,  for  given  parameters 

cp  and  m  of  T,  H  is  the  class  of  1/0  assignments  for  which 
max 

m-<p  +  l  <  H  V!  <  m 
max  —1  — 


Moreover,  if  is  the  subclass  of  H  for  which  y  ^  H  V|  *  j,  then 


•H 

m— <p  +1 

max 


, . . . ,H  }  is  a  partition  of  H.  Therefore,  denoting  I(j)  -  1^ 


bv  che  definition  (3)  of  1^,  we  have 


I_  ■  min{I(m-^p  +1)  ,  .  . .  ,I(m)  } . 
cl  max 


Therefore  we  can  work  with  I(j)  to  obtain  a  lower  bound  to  1^.  Notice  at  firs; 
that  l(j)  -  I(j-1)!<_  1,  since  by  just  sending  one  bit  from  to  P 2  we  can 
transform  an  assignment  in  to  one  in  H^_^, 

1(0)  ■  0  and  the  above  inequality  we  obtain 


(or  vice  versa).  Thus,  by  using 


I(m)  m  , 


(9) 


I(») 


L,  <  <p 
ti  —  max 


-  1  . 


(10) 


This  inequality  is  used  crucially  in  the  following  theorem. 

Theorem  4.  Let  G  be  a  computation  graph  solving  II  in  time  T.  Then 


Proof.  Let  <p  ■  max  <p(v)  .  Let  T  be  the  class  of  dichotomies  of  G 

-  max  „ 

v  €  V 

satisfying  (3)  and  H  its  corresponding  class  of  I/O  assignments.  Clearly, 


<P  <  T,  since  each  port  reads  or  writes  at  most  one  bit  per  unit  of  time, 
max  — 

Relation  (10)  becomes  I(m)  -  1^  <_  T  -1 
Therefore  relation  (7)  can  be  rewritten  as 


,  or  equivalently  1^  >  I(m)  -  T  +•  1 . 


AT2  >  x.  |K| 

—  1  1  1  m 


(ID 


for  some  constant  X^.  This  bound  may  become  weak  for  large  values  of  T; 

however,  for  large  T,  we  expect  AT  to  remain  large.  Indeed,  we  have  the  interplay 

of  two  contrasting  phenomena,  an  I/O  bound  (prevailing  for  large  T)  and  an 

(*) 

information-exchange  bound  (prevailing  for  small  T)  .  Specifically,  the  I/O 
bound  AT  >_  j  yields 


AT  _>  |K|T; 

Combining  relations  (11)  and  (12)  we  obtain 


(12) 


AT2  _>  |l<|max{X1  ( 1 ,T} 


m 


(*) 


Lower-bound  arguments  of  analogous  flavor  are  due  to  Savage  f  9  ]  and 
3rent-Kung  [ 10] . 


*>  '  •  '  • 
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If  we  choose  for  T  the  value  Tg  -  I  (m)/(2I(m)+(mA^)) ,  then  we  have 

.1.  T  <_  TQ.  Trivially,  I(m)  -  T  +  1  >  I(m)  -  T.  Since  (I(m)-T)2  is  a 

decreasing  function  of  T,  it  achieves  its  minimum  at  T-,  whence: 

0  ,V 

(I(m)-T+1)2  >  (I(m)-I2(m)/(2I(m)+<m/X1)))2  >  I2(m)  ( (I+CmA^) /(2I+2(mA^)  ]  -  I2(mf/. 

2.  T  _>  Tq.  By  (9),  I(m)  <_  m,  whence  T  _>  TQ  _>  I2(m) /^m+m/X^  >_  I2(m)/4m  A 
taking  X^  £  1/2. 

This  completes  the  proof  of  the  theorem.  D 
2.3.  Saturation  area-time  lower  bounds  (AT-theory) 


When  we  ideally  isolate  a  region  of  the  layout  of  a  VLSI  system,  not  only 
is  the  bandwidth  between  this  region  and  the  remaining  part  of  the  layout  bounded 
by  the  perimeter  of  the  region,  but  also  the  amount  of  information  that  can  be 


stored  within  the  region  is  bounded  by  its  area.  This  fact  has  important 
consequences  for  the  area-time  performance  of  some  computations.  In  this  section^ 

we  develop  techniques  to  express  these  effects  in  a  quantitative  manner. 

.  •  *, 
In  the  familiar  framework  where  processors  and  P ^  cooperate  to  solve 

mr 

problem  .1,  we  now  assume  that  only  a  limited  amount  s^  of  storage  is  available  ^ 
in  ? .  (i  ■  1,2),  and  refine  Definition  4  as  follows: 

^  ’.S 

Definition  5.  The  information  exchange  of  II  under  assignment  )  and  the' 

condition  that  P^  can  store  at  most  bits  (i  ■  1,2)  is  defined  as:  I? 

I(  "A  ,Y  ^  ls^.s^)  “the  minimum  over  all  algorithms  (which  solve  H  under  assignment 

(*^1  '^2}  and  storaSe  bounds  and  s^)  of  the  maximum  over  all 

problem  instances  of  the  number  of  bits  exchanged  between  P  and-  I 

1 
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Id  analogy  with  previous  definitions,  for  a  class  H  of  assignments  for  H 


we  have 


Iu(s.,s  )  -  min  I (  V  SIT  js  ,s  )  (13) 

12  (V  ,V  )6  H  12  12 

and  let  I(m|s^,s2)  ■  1^  Note  that  I(  (V^,^)  I  is  nonincreasing 

m 

function  of  s^  and  (more  storage  never  hurts),  whence 

KW  "  I(  71  ,V2  l“’*)  -  I(  ri,r2  ^  si*  S2}  * 

We  now  identify  with  processor  P^  a  cell  of  a  square  tessellation  of  the 

layout  with  sidelength  l,  and  let  m  be  the  number  of  variables  handled  by  P^. 

2 

Clearly  the  storage  of  P^  is  upper-bounded  by  l  ,  and  we  have 

2  2  2 

I(m|s^,s^)  »  I(m| l  ,A -l  )  _>  I(m| l  ,+«) ,  where  A  is  the  layout  area.  Since  the 
cell's  perimeter  is  4 1,  we  conclude 

T 1  .  (i4) 

If  |  *lr j  variables  are  input/output  in  area  A  ,f  or  any  U  c  V  there  is  at 

2  2 

least  one  cell  of  the  tessellation  that  handles  at  least  ]fy|  l  /A  variables  of  V,. 

However,  due  to  the  granularity  of  the  input/output  there  may  be  no  cell 

2  2 

Handling  exactly  {*U|l  /A  variables;  on  the  other  hand,  if  cell  C  handles  more 

1  2  2 

than  |ty|l  /A  variables,  since  <p  <  T,a  zig-zag  cut  will  isolate  a  portion 

nx&x  “■ 

2  2 

of  C  handling  ( |  "Uj  1  /A  +  h)  variables  for  some  h  satisfying  0  h  <  T.  Thus 


we  can  write 


T  >  I 


+  h  :  l 


>y 


This  discussion  proves  the  following  theorem. 


4 i  for  some  h  €  [O.T-l] 


-V  -  *  •  -  -  «  •  ».  •,  I  *.  \  *•  •  * .  '  .  ’•  *»  _*•  ^  ,**  . '  *  *  ,'*  •**  ***  m  *  »  *  «.  * 
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Theorem  5.  Let  G  be  a  computation  graph  for  problem  H.  Let  li.  be  a  set  of 
I/O  (binary)  variables  of  IL  If  is  the  class  of  assignments  such  that  r- 

exactly  m  variables  of  U  are  assigned  to  P  ,  and  I(m|s,»)  is  the  information  n 

exchange  of  when  has  s  bits  of  storage*  then  the  area-time  performance  of 
any  layout  of  G  satisfies  the  bound  for  arbitrary  Zi  K' 

T  >  min  +  h|i2,»)/u.  (15) 

0<h<T  \  A  /  J: 


Remark  1 .  To  obtain  the  best  bound  we  choose  the  value  of  i  that  maximizes  the 


right-hand  side  of  (15). 


*2,  . 2 


Remark  2.  If  I(m|s,«)  is  increasing  with  m,  then  T  >_  I(|t(|-— |Z  ,<»)/4fc.  '•* 

A  ♦  * 

Remark  3.  If  for  the  value  lQ  of  Z  that  maximizes  (15)  we  have  I(m|zo>®)  »  Bm 

for  some  constant  B  (as  is  found  in  all  applications  considered  so  far) 
then  we  can  rewrite  bound  (15)  as 


T  1  3M  r” 


A’4J l. 


or  equivalently 


AT  >  n(|K|  l  ) 


(16) 


Usually  ig  is  an  increasing  function  of  j*U|  so  that  (16)  is  a  stronger  bound 
chan  the  straightforward  I/O  bound  AT  ■  Q(  *U| ) . 


VJ 


In  all  of  Che  previous  considerations  we  have  viewed  the  computational 
process  Inside  a  VLSI  chip  as  a  f luido-dynamic  phenomenons  so  that  the 
performance  bounds  are  determined  by  the  ability  to  ensure  certain  flows  within 
the  prescribed  time.  Note  that  this  viewpoint  is  completely  oblivious  of  the 
fact  that  the  vertices  of  G  have  bounded  fan-in  and  fan-out;  however,  computation 
delays  are  exactly  due  to  these  limitations,  which  we  now  wish  to  bring  into 
the  picture. 

We  begin  by  considering  the  notion  of  "functional  dependence". 

Definition  6.  Given  a  function  y  ■  f(x),  where  x  ■  (x. , . . . ,x  )  and 

-  —  i  p 

v_  ■  (y^,...,y  )  are  boolean  vectors,  we  say  that  y  is  functionally  dependent 

on  x^  if  there  exist  two  boolean  vectors  x_'  and  x^"  that  differ  only  in  the 

i-th  component,  such  that  2.'  *  and  jr."  “  f(x")  differ  in  the  j-th  component 

We  now  explicitly  consider  that  all  gates  of  our  circuits  have  fan-in 

bounded  by  a  constant  f^;  if  output  variable  y  is  functionally  dependent  on  s 

input  variables,  then  at  least  log  s  time  units  must  elapse  between  the  instant 

£I 

when  the  first  of  the  input  variables  is  read,  and  the  instant  when  v  is  output. 


Hereafter  we  assume  f^  ■  2. 


Informally,  suppose  that  s  variables  x  , ...,x  are  input  at  the  same  time, 

1  s 


and  that  there  exist  r  output  variables  y  , . . . ,yr  with  the  following  properties: 


(1)  each  y^  is  functionally  dependent  on  all  x^'s; 


(2)  the  variables  y, ,...,y  carry  I  bits  of  information  on  x, . x  . 

1  r  Is 


Then  since  no  y  emerges  before  logs  units  of  time,  the  system  must  be  capable 


of  storing  I  bits  for  logs  units  of  time,  or  AT  >  I  logs.  In  other  words, 
in  the  analogy  where  information  is  a  fluid  flowing  from  input  ports  to  output 


ports,  we  can  say  that  functional  dependence  acts  as  a  kind  of  computational 
friction,  that  slows  down  the  flow  keeping  it  below  capacity.  This  intuitive 
notion  can  be  formalized  in  the  following  theorem: 

Theorem  6 .  Given  a  computational  problem  II  with  a  set  0  of  input  variables 
and  a  set  &  of  output  variables,  let  V,  be  a  subset  of  $  such  that  for  any 
partition  of  V.  there  exists  a  collection  3^,...,y  of  disjoint  subsets 

of  b  (not  necessarily  a  partition)  satisfying  the  following  properties: 

(1)  Each  variable  in  is  functionally  dependent  upon  at  least  a(|V|) 

variables  of  where  a(  )  is  an  increasing  function  of  its  argument. 

(2)  The  variables  in  V  -  "U  can  be  selected  so  that,  for  each  t  ■  1,2,...,T 

the  variables  of  carry  at  least  8(|^J)  bits  of  information  relative 
to  "U t,  where  S(  )  is  an  increasing  function  of  its  argument. 

Then,  for  any  VLSI  system  that  solves  H, 

T 

AT  >.  min  Z  8(s  )loga(s  )  .  (17) 

^+...+  ^-1^1  t-1 

Proof .  Let  us  define  as  the  subset  of  variables  of  *U  that  are  input  by 
the  system  (exactly)  at  time  t.  Clearly,  , ...  ,t(T  partition‘d.  Let 
>1 , . .  .  be  the  corresponding  collection  of  disjoint  “subsets  of  2  .  Because 

of  property  (1),  and  of  the  bounded-fan-in  assumption,  no  variable  in  can 

be  output  before  time  t  +  loga('t<c’).  3ecause  of  property  (2),  at  least 
3(  *dt  )  bits  of  information  on  have  to  be  stored  by  the  system  in  the  interval 
[t,t  +  log  o(jty  ])].  Since  one  bit  stored  per  x  units  of  time  contributes  t  to 
the  AT  product,  we  conclude  that 

T 

AT  >  Z  S(  j*d  |  )loga(  |‘UC  j ) 


(18) 


where  Z  |V|  -  T.  The  partition  tL  ,...  .fy-,  and  hence  the  right-hand  side  of  (2) 
t-1  T  11 

is  a  function  of  the  input  schedule  and  varies  from  system  to  system,  but  it  is 
never  smaller  than  the  right-hand  side  of  (1) .  □ 

Corollary  1.  Under  the  same  assumptions  of  Theorem  1,  if  S(s)logo(s)  is 
a  downward-convex  function  of  s,  then  the  minimum  in  (1)  is  achieved  when 
•  a2  "  "  ST  •  1^|  /T ,  and  (1)  yields, 

A  >  SdKl/THogad^l/T)  .  (19) 

Corollary  2.  If  8(s)  **  BqS,  and  a(s)  **  c^s,  then  (3)  can  be  rewritten  as 

A  -  n(|K|/Tlog(|K|/T)),  (20) 

or 

AT/log  A  -  Q(lK|).  (21) 

Theorem  6  and  its  Corollaries  1  and  2  apply  to  chips  with  bounded  fan-in. 

Similar  arguments  yield  analogous  results  for  chips  with  bounded  fan-out.  In 

this  case,  the  friction  arises  from  the  fact  that,  if  s  variables  y, ,...,y  are 

i  s 

output  at  time  t,  and  are  all  functionally  dependent  upon  input  variable  x, 
then  x  must  be  input  prior  to  time  t  -  logs. 


In  this  section  we  shall  illustrate  the  lower-bound  techniques  introduced 
in  Section  2  by  applying  them  to  the  sorting  problem,  which  we  define  formally 
as  follows . 

Definition  7.  In  the  (n,k)-sorting  problem. 

(1)  The  input  is  a  sequence  of  n  k-bit  keys,  each  a  member  of  a  finite  set  of 
integers . 

(2)  The  output  is  a  rearrangement  of  the  input  keys,  so  that  they  form  a 
nondecreasing  sequence. 

Notationally,  input  and  output  are  represented  as  n  x  k  binary  arrays 
X  »  {X^}  and  Y  =■  {Y^}  ,  respectively,  with  i  *  0,  ...,n-l,  and  j  -  k-l,...,0; 

X^  is  the  (i+l)-st  input  key  and  X^  .is  the  bit  position  of  weight  2^  (or  briefly 
bit  position  j).  We  can  view  the  input  of  (n.k)-sorting  as  an  (n.k)-multiset, 

k 

i.e.  a  multiset  of  n  elements  each  drawn  from  a  universe  of  2  values.  As  we 
will  show,  the  nature  of  (n.k)-sorting  changes  considerably  with  the  relative 
size  of  n  and  k,  and  we  find  it  useful  to  classify  (n.k)-multisets  in  multisets 
of  short  kevs  (1  <  k  <  logn) ,  of  intermediate-length  keys  (logn  <  k  <  21ogn) , 
and  of  long  keys  (21ogn  £  k) .  With  a  slight  abuse  of  terminology,  we  shall  use 
phrases  like  "sorting  short  keys"  instead  of  "sorting  a  multiset  of  short  keys  •" 
To  gain  intuition  on  the  transfer  of  information  from  input  to  output,  we 
observe  that 

Y1  *  Xi(i)  1  ■  j  ■  0,1 ....  ,k-l , 

where  t(0)  ,-r(l)  , . .  .  ,-r(n-l)  is  a  permutation  of  0,1,...,  n-1 .  Thus,  there  is  an 


information  flow  from  the  inpuc  to  the  output  ports  of  the  same  position,  which 


we  call  primary  flow.  In  Che  primary  flow,  each  bit  enters  and  leaves  the 

system  maintaining  its  identity.  However,  the  exact  destination  of  each  bit 

within  its  own  position  depends  on  tt,  which,  for  position  j,  is  determined  by 

the  values  of  the  data  in  positions  j , j+1, . . . ,k-l .  This  inf ormation, flowing  from 

most  significant  to  least  significant  positions,  is  called  secondary  flow. 

Primary  flow  considerations  have  been  used  by  Thompson  [1]  in  proving  the 
2  2  2 

AT  »  Ii(n  log  n)  bound  for  word-local  (n,logn+9(logn))-sorting.  The  bound  has 
been  later  generalized  to  non-word-local  protocols  by  Leighton  [11]  who  succeeded 
in  combining  primary  and  secondary  flows  with  the  help  of  cyclic-shift  arguments. 

In  Section  3.1  and  3.2,  we  combine  a  quantitative  characterization  of  primary 
and  secondary  flows  with  the  general  techniques  of  Section  2,  to  obtain  novel  lower 
bounds  for  the  problems  of  sorting  short  and  long  keys. 

3.1  Short  Keys 

In  this  section  we  study  (n,k)-sorting  for  k  <  logn,  and  denote  by 

'  k  2 

r  *  2  the  size  of  the  universe  of  possible  keys.  We  shall  obtain  an  AT  and 

an  AT  bound.  Both  bounds  are  based  on  primary  flow,  as  expressed  by  the 
following  lemma. 

Lemma  2.  Chosen  r/2  arbitrary  input  bits 


\  -  {X°  ,X°  ,...,X 


p0  P1 


Pr/2-l 


ind  z'2  arbicrarv  output  bits 


'U  -  (Y°  ,Y°  . Y° 

out  qQ  qx  q. 


out  q0  ql  qr/2-l 

with  q  <  the  remaining  input  bits  can  be  set  to  constant  values  to  enforce 


tbe  condition 


Y°  -X° 
qi  ?i 


i  *  0,1 , . . . ,r/2-l  . 


Proof:  We  set  X  *  2i  +  Xw  ,  i  ■  0,1 , . , . ,r/2-l  and  divide  the  remaining 

- -  Pi  Pi 

n-r/2  input  keys  arbitrarily  into  r/2+1  sets  such  that,  for  i  *  Q,l,...,r/2- 

the  i-th  set  contains  (q  -q  ^-1)  keys  whose  value  is  set  to  2i(q_-j_”  0),  and 
the  (r/2)-th  set  contains  (n-l-qry2_x^  keys  whose  value  is  set  to  r/2-1.  Th 

output  sequence  corresponding  to  this  input  is  shown  in  Figure  1,  and  it 

satisfies  Equation  (24) .  □ 

2 

We  state  now  the  AT  bound  for  short  keys. 

Theorem  7 .  Any  VLSI  (n,k)-sorter ,  with  k  <  logn  satisfies  the  bound 

AT2  -  Q(nr), 


where  r  *  2 


k— 1  bits 


keys 


keys 


(9r,2-1“^r/2-2—1  )  keY* 


(n-1-qf/2_,}  keys 


r/2-1 

A 

r/2-1 

r/2-1 


r/2-1 


kV:-: 


Figure  1.  Sending  r/2  arbitrary  bits  in  Xu  to  r/2  arbitrarv  oositions  in  Y 


Proof .  Result  (25)  follows  from  Theorem  4,  with  1 <  *  {X^  :  i  »  0, 1 , .  -  .  ,n-l } , 
m  -  r/2,  and  from  the  bound  I(r/2)  -  ft(r),  which  we  now  prove. 

Given  a  protocol  that  assigns  exactly  r/2  entries  of  to  and 

0 

n  -  r/2(>  r/2)  to  P„,  let  P  be  the  processor  that  outputs  more  entries  of  Y 
—  2  s 

(break  a  tie  arbitrarily).  From  Lemma  2,  we  can  always  find  two  sets  and 

V  as  in  (22)  and  (23)  such  that  V  is  input  by  P-  and  H  is  output  by 

out  in  J-s  ouc 

P  .  Equation  (24)  implies  that  r/2  bits  input  by  P,  are  output  by  P  ,  for 

3  S  S 

suitable  values  of  input  variables  not  in  “U.  .  Hence*  I(r/2)  >_  r/2  =  f<(r),  as 

xn 

desired.  Q 

We  shall  now  prove  an  AT  lower  bound  based  on  information  exchange  under 
bounded  storage  (saturation) . 

Theorem  8.  Any  VLSI  (n.k)-sorter ,  with  k  £  logn,  satisfies  the  bound 

AT  -  Q(n/r)  (26) 

where  r  *  2  , 

Proof .  Referring  to  Theorem  5,  we  choose  V,  »  Y  ,  and,  for  some  real  a€  [0,4] 
we  consider  a  square  tessellation  with  sidelength  y/or.  Thus,  there  exists  at 
least  one  cell  C  —  identified  with  processor  —  that  outputs  m  _>  nar/A 
entries  of  Y°.  To  estimate  the  quantity  I(mjar,»)»  we  subdivide  the  interval 
[0,T]  into  consecutive  intervals  and  evaluate  the  information  exchange  in  each 
such  interval.  Specifically,  we  form  the  sequence  intervals 

C[t  +l.t  .  j:  0  <  i  <  L,  t.  *  -1,  tT  3  T) ,  so  that  during  [ t . +1 , t  . ]  C  outputs 
i  i+1  —  0  ^  l  x  •  i 

m  bits  of  Y°,  with  r/2  <  m.  <  r/2+cr  (this  partition  of  [ 0, T]  is  well-defined, 
i  —  i 

since  C  outputs  at  most  ar  bits  per  time  unit)  .We  now  consider  ft^+l.t^^] 
individually  We  apply  Lemma  2  by  choosing  the  r/2  elements  of 


> 


among  the 
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bits  of  and  t(^n  c  X^  arbitrarily.  Since  X^  must  be  completely  input 
before  any  bit  of  Y°  can  be  output,  during  U  +l,t  j.]  cell  C  outputs  r/2  bits 
already  in  the  chip  at  t^+1.  But  C  stores  at  most  or  bits,  whence  ( h-o)r  bits 
must  flow  across  the  cell  boundary  (of  length  4/or)  during  the  interval. 
Summing  this  flow  over  all  L  intervals  we  have 

I(m|or,»)  _>  L(Jj-a)r, 

and,  observing  that  L  >_  m/maxCm^^)  _>  (nor /A) /r (%+a)  we  obtain 


r/  i  \  (*s-a)  or 

Kinlcr,-)  >  ' 

Since  I(m|or,»>)  flows  across  a  section  of  length  4/or  in  time  T  we  have 


/o" 

AT  > 

—  4 


(27) 


which  completes  our  proof.  (Inequality  (27)  yields  the  best  bound  for  o  -0.117 
From  Theorems  7  and  8  we  know  that  there  exist  constants  3^  and  S2  such 
that  the  performance  of  any  (n, k) -sorter ,  with  k  <_  logn,  satisfies  the  bounds 
AT  _>  S^n/r,  and  AT  _>  32nr.  These  bounds  coincide  for  TQ(r)  -  (32/3^)/r  5 
therefore  since  T  logn+r,  only  the  AT  bound  is  meaningful  for  values  of  k 

satisfying  (32/8 1)2k,/2  -  logk  <_  logn,  while  for  values  of  k  satisfying 
( 32/31)  2 k/^2  -  logk  >  logn  the  AT  bound  is  stronger  for  T  >  TQ,  and  the  AT2  bound 
is  stronger  for  T  <  T^. 


3 . 2  Long  Keys 

In  this  section,  we  turn  our  attention  to  (n,k)-sorting  for  k  _>  21ogn. 

We  begin  by  deriving  two  lower  bounds  on  the  information  exchange  I$^>V7)  for 
an  arbitrary  assignment  ^  )  of  variables  to  processors  ?1  and  P0 .  The  two 
bounds  will  be  based  on  primary  flow  and  secondary  flow,  respectively. 

Let  (S^,^)  an  I/O  assignment  for  (n.k)-sorting .  For  some  real 
Y  z  we  define  a  number  of  quantities,  each  a  function  of  the  parameter  v 


and  of  ("/.  ,^\) . 
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bj  *  number  of  input  words  whose  j-th  bit  is  input  by  P^, 

Cj  *  number  of  output  words  whose  j-th  bit  is  output  by  P^  > 

Jq  -  {j  :  J  <  k-logn,  yn  <  <  (l-Y)n) 

-  {J  :  j  <  k-logn,  0^  >  (l-y)n) 

J2  "  U  s  j  <  k-logn,  Cj  <  yn} 

k-logn-1 

B  -  E  b,,  B  «  E  b.,  s  -  0,1,2. 

J-O  J  s 

Note  that  B  is  the  number  of  bits  input  by  P^  in  positions  with  index  <  k-logn. 
In  terms  of  the  above  quantities,  we  can  state  the  following  results. 


Lemma  3  (Primary  flow) .  With  the  above  definitions,  for  any  I/O  assignment 
(^L*^)  (n»k)-sorting,  (k  21ogn) ,  and  for  any  y€  [0,^],  the  information 
exchange  satisfies  the  bound 


-  y(B"Bi)  -  Y(B"  lJrn)  1 


Proof .  By  a  suitable  selection  of  the  logn  most  significant  input  positions, 

we  can  force  the  output  sequence  Y„,...»Y  to  be  any  arbitrary  cyclic  shift 

0  n— 1 

of  the  input  sequence  If  we  let  be  the  information  exchange 

pertaining  to  the  position  j  over  all  the  n  different  shifts,  then 


vi  ’  V-V  +  <n'Vcj  • 

In  fact,  each  of  the  b  bits  input  by  P,  is  outout  bv  P„  (n-c.)  times,  and, 

J  1  '  '  J 

symmetrically,  each  of  the  (n-b^)  bits  input  by  P^  is  output  by  P^  c^  times. 

3y  the  pigeon-hole  principle  there  is  a  shift  size  with  information 
exchange  not  smaller  than  the  average,  which  ensures  that 


,  k-logn-1 


For  j  €  Jq  U  we  have  n-c^  _>  yn  and  _>  bj(n-c^)  _^b^yn.  Thus,  (29)  yields 

I .7  )  >  (b  .  Z  b  yn  -  y(Bn+Bj-  y(B-B  )  . 

1  2  "  n  j  €  JQU  J  2  *  02  1 

The  proof  is  completed  by  observing  that  B."  Z  b.  <_  |j_  jn,  trivially.  □ 

j  €  3 

Lemma  4  (Secondary  flow) .  For  any  I/O  assignment  & (n.k)-sorting, 

(k  _>  21ogn)  ,  and  for  any  y  €(0,^3,  the  information  exchange  satisfies  the  bound 
1(^1  V 2)  L  (l~Y)n  min(|  J1|  ,  |J2|  ,logn)  -  Jjnlogn-n  .  (30 

Proof .  Assume,  without  loss  of  generality,  that  reads  at  most  half  of  the 
bits  in  the  logn  most  significant  positions  of  X.  We  now  construct  a  set  J  of 
logn  positions,  so  that  the  content  of  (X1  :  k-logn  <_  I  <  k)  can  be  recovered 
from  {Y^  :  j  €  J}  .  Specifically,  if  |  J^|  logn,  J  is  just  any  selection  of 
logn  positions  of  J^;  else,  we  augment  with  (logn-iJ^J)  positions  not  in 
and  of  index  less  than  k-logn. 

We  then  consider  the  following  class  of  input  instances. 

(1)  The  leading  logn  bits  of  X^  represent  integer  rr ( i) ,  where  t  is  a 

permutation  of  0 . n-1. 

(2)  The  logn  bits  of  X^  which  belong  to  positions  in  J  represent  i. 

(3)  All  remaining  bits  are  zero. 

Then,  the  output  array  Y  has  the  following  structure. 

(1)  The  leading  logn  bits  of  Y^  represent  integer  i. 

(2)  The  logn  bits  of  Y^  which  belong  to  positions  in  J  represent 
integer  (i) . 

(3)  All  remaining  bits  are  zero. 

Thus,  t  can  be  recovered  from  the  output  positions  {Y^  :  j  €  J}  .  Since  P 


outputs  at  least  min( | | ,logn)* (l-y)n  bits  of  these  positions,  and  it  reads 
at  most  4nlogn  bits  among  the  (nlogn-n+0(logn))  bits  that  specify  it,  we  have 
that 

I i  (l-Y)n*oin(J  J^|  ,logn)  -  4nlogn  -  n  . 

Considering  the  alternative  case  (i.e.,  P ^  reads  at  most  half  of  the  bits  in 
the  logn  most  significant  positions)  we  obtain  (30).  □ 

3.2.1  AT2  bound 

Lemmas  3  and  4  can  now  be  combined  to  prove: 

Theorem  9 .  Any  VLSI  (n.k)-sorter ,  with  k  _>  21ogn,  satisfies  the  bound 

AT2  -  0(k  n2logn).  (31) 

Proof .  With  the  notation  of  Theorem  4,  we  choose  ' U  *  (x|  :  0  <_  i  <  n, 

0  ^  j  <  k-logn}  and  select  so  that  m  »  nlogn.  (Note  that  B,  the  number  of 
bits  of  “U  input  by  P^s  is  equal  to  m  since  “U  is  a  set  of  input  variables.)  With 
this  choice,  |t(|  *  (k-logn)n;  assuming,  for  simplicity,  k  >_  3  logn,  the  result 
follows  from  Theorem  4,  if  we  can  show  I(nlogn)  ■  Q(nlogn) . 

Indeed,  from  Lemma  3  and  B  ■  nlogn,  for  any  y€  [0,4]  we  have: 

I(nlogn)  >_  y( nlogn  -  n [  ; )  .  (32) 

If  we  reverse  the  roles  of  P^  and  P^  in  Lemma  3,  and  note  that  P,  reads 
(k-2  logn)n  _>  nlogn  bits  of  we  also  obtain 

I(nlogn)  _>  Y(nlogn-n  |J,|)  . 

If  min(  J  ;  ,  ■  J  ; )  _>  logn,  then  inequality  (30)  yields 

I(nlogn)  _>  (l-Y)nlogn  -  4  nlogn  -  n  ■  ,l(nlogn) 
by  selecting  y  *  0.  Otherwise,  (30)  becomes 
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I(nlogn)  _>  (l-y)n-min(l  ,  |  J2|)  -  h  nlogn  -  n  .  (33) 

Assuming,  without  loss  of  generality,  that  | J>1 1  <_  |j  |,  we  multiply  both  sides 
of  (32)  by  (1-y)  and  both  sides  of  (33)  by  V ,  and  add  corresponding  sides;  we 
obtain 

I(nlogn)  _>  Y(4-Y)nlogn  -  yn  ■  ft(nlogn), 
by  selecting  y  •  □ 

3.2.2  AT  bound 

We  now  turn  our  attention  to  saturation  flow.  Let  integer  t(j)  €  [0,T] 
be  the  instant  at  which  the  last  bit  of  is  input;  we  then  define  the 
following  sets: 

^(t)  ■  (j  :  j  €  J1  and  t(j)  -  t}. 

The  sets  J^(t)  s  t  *  0.....T-1  form  a  partition  of  (note  that,  for  some  t, 
J(t)  may  be  empty).  Note  that  no  bit  belonging  to  a  position  in  J^t)  can  be 
output  prior  to  t+1.  These  sets  are  instrumental  in  establishing  the  following 
result : 

Lemma  5  (Saturation  flow)  .  Let  s  be  the  number  of  storage  bits  of  P^.  Given 
an  I/O  protocol  inducing  an  I/O  assignment  ^) »  and  an  interval 

let  Z  5;  J^(t^)  J^(‘Ci+i)  U  ...  U  J^(t2)  be  a  set  |z|  —  l°8n  bit  positions. 

Then,  for  any  v  € [0,^] ,  we  have 

l<rvr2\s^)  >  ^y^-n|z!  -  s  .  (34) 


Moreover,  if  are  pairwise  disjoint  time  intervals  for  i  =  1,2,...,L, 

and  ^  J^(t^)  U  ...  U  J^(t2),  witb  ! I  i.  i°gn,  then 


IC^.^Is,.)  n  z 


i-1 


Zj-sL 


(35) 


Proof .  Let  z  be  the  number  of  bits  belonging  to  positions  in  Z  that  are  output 
by  P^  in  We  first  consider  the  primary  flow,  under  saturated  conditions 

All  JZjn  bits  belonging  to  positions  in  Z  have  been  input  by  time  but  none 
is  output  prior  to  x^.  Since  (l-y)jz|n  of  them  are  output  by  P^  (note  that 
2  C  J^) ,  then  (l-y)|z|n-z  are  output  by  P^  after  Since  P^,  stores  at  most 

s  of  them,  (1-y)  j  Z  [  n-z-s  are  stored  in  P^  at  and  ®ust  be  transferred  from  P^ 
to  P^  after  x It  follows  that 

a  (1-v)  |z|n-z-s  .  (36) 

This  primary  flow  bound  is  strong  for  small  z;  for  large  z,  we  have  a 
correspondingly  large  secondary  flow.  Indeed,  since  [Z|  <_  logn,  we  augment  Z 
to  Z*  by  adjoining  (logn-|z|)  arbitrary  positions  of  index  <  k-logn.  Arguing 
as  in  Lemma  4,  we  can  show  that  nlogn  bits  relative  to  {X1  :  i  -  k-logn, ... „k-l} 
can  be  extracted  from  {Y^  :  j  6  Z*}.  But  P^  outputs  z  of  these  bits  during 
f  x ^ , T ^ ] ,  and,  since  at  most  s  are  in  P^  at  x^,  z-s  mu3t  be  transferred  from  P^ 
to  P^  in  [t  .x  ].  It  follows  that 

IpT^s,-)  i  z-s.  (37) 

Adding  corresponding  sides  of  (36)  and  (37)  and  dividing  by  2  we  obtain  (34). 

To  prove  the  general  result  (35)  we  now  claim  that,  if  Z ^  and  Z ^ 
correspond  to  nonoverlapping  intervals,  then  their  contributions  to 
I l3 »")  are  independent  and  can  be  added. 

Indeed,  for  the  primary  flows  generated  by  Z^  and  Z^ ,  the  claim  follows 
from  the  fact  that,  although  possibly  overlapping  in  time,  these  two  flows 
carry  independent  information  on  the  input  values  ( r  Z.  ■  0) 


r  ■  !■ 1  v  v ;  r.  <r  p.  -■ 


•1  ■ 
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For  the  secondary  flows  generated  by  and  the  claim  follows  from  the 

fact  that  they  occur  in  different  periods  of  time  0  tT^»T2^  “  £)  • 

k.-l  k-logn 

Intuitively,  P^  needs  information  on  the  leading  input  positions  X  . X  » 

at  several  different  times  and,  since  it  cannot  afford  to  store  that  information 
permanently,  every  time  it  must  receive  it  from  P^.  □ 

This  result  is  now  used  to  establish  the  following  theorem: 

Theorem  10.  Any  VLSI  (n.k)-sorter ,  with  k  _>  2  logn,  satisfies  the  bound 


(38) 


AT  -  fl(kn/nlogn) . 

Proof .  With  the  notation  of  theorem  5  we  again  choose  "U  *  (X^  :  0  <  i  <  n, 

0  <_  j  <  k-logn  and  select  P^  with  area  (storage)  s  *  onlogn  for  a  suitable 

constant  a.  With  this  choice  |*U I  -  (k-logn)n;  if  we  can  show  that 
I(m|  anlogn,<*>)  *  Q(m) ,  then  bound  (38)  follows  from  (16),  since  2^  ■  4/s”  «  4/anlogn- 
[In  the  following  arguments  we  introduce  parameters  y,  a,  e,  and  ? 
whose  values  could  be  chosen  for  a  fine  tuning  of  the  lower  bound.  A  feasible 
choice  that  gives  the  right  feeling  for  the  range  of  these  parameters  is 
y  -  1/12,  a  -  5/24,  e  -  1/4,  5  -  29/36  .] 

For  given  real  e  €  [0,1] ,  we  partition  the  sequence  of  times 

..  <  ty  if  | J  (t^) J  2  elogn 


0,1, . . . ,T-1,T  into  two  subsequences  t^  <  t^  < 


and  t^'  <  t”  <  . . .  <  t"^  if  <  elogn.  Obviously 


u  v 

2  +  :  ! 
h-1  h-1 


J!Cth): 


J1 


(39) 


We  now  consider  separately  the  contributions  to  the  information  exchange  of 
positions  belonging  to  J^(t^)  ( t '-sequence)  and  J^(t^)  ( t"-sequence)  . 


t* -Sequence.  At  time  t^,  all  the  |j  (t^)|n  bits  pertaining  to  positions  in 
J^(t^)  are  in  the  chip,  and  -  by  definition  of  -  at  least  (1-y) | J^(t^) |n 
of  them  have  to  be  output  by  P^.  Due  to  the  bound  on  the  storage  of  P^,  at 
least  (1-y) | J^(t^) J n-cnlogn  of  these  bits  are  in  P^  at  time  t^  and  will  be 
eventually  transferred  to  P^.  Thus,  the  overall  contribution  I'  to  the 
information  exchange  associated  with  the  t'-sequence  is  bounded  below  as 


I*  _>  (l-y)n  £  IJ^t^)!  ~  uonlogn. 


(40) 


h-1 


Since  '  J1  ( t/ )  |  _>  elogn,  uelogn  Z  |jj(t£)|  and  (40)  can  be  rewritten  as 

h-1 


I*  >  (l-y-a/e)n  Z  |JL(t^)| . 
h-1 


(41) 


t ''-Sequence.  We  decompose  fO,T]  into  the  sequence  of  intervals  ([t^  +  l,t£ 


i  -  0,  -  . .  ,L,  t^'  -  -1  and  t^ 


+1 


T)  as  follows:  For  some  real  5(e<£<l)  and 


for  i  -  0, . . . ,L-1 


i+1 


(5-s)logn  <  Z  |J .(0|  <_  £logn. 
h-h  +1 


Such  a  decomposition  always  exists,  since  Jj^(t^)  |  <  elogn.  Moreover, 


L  <  Z  |j1(t^’)j/((C-e)logn) 
h-1 


(42) 


since  at  least  (--s)logn  positions  complete  their  input  in  any  given  interval. 


i  i, 


We  can  now  apply  Lemma  5  with  +^,c4'  +1^  '  '  1  UJj(t^' 

i  i+1  i  1 

and  s  -  mLogn  to  conclude  that  the  overall  contribution  I"  to  the  information 


exchange  associated  with  the  t"-sequence  is  bounded  below  as 


31 


I"  SB  (1-y)  n  E  I  Jx(  th>  I  ”  Lal°8n 
T"  h*l 


which,  by  using  (42)  can  be  restated  as 


Jr'W 


By  choosing  C  *  s  +(l/e-(l-Y)/  2^  )  the  coefficients  of  the  two  bounds 
(41)  and  (43)  become  identical  and,  by  (39),  we  obtain 


I(m|  onlogn,»)  ■>  I'  +  I"  >_  (l-y-c/£)n|  | 


We  now  observe  that  Lemma  3  (with  B  ■  m)  can  be  invoked,  yielding: 


I(m| alogn,«)  _>  I(m)  >_  y(®“  |*J^  I *0 


If  we  choose  a/e  -  l-2y,  then  (44)  and  (45)  yield 


I(m| anlogn,*)  _>  y/2  m 

and  the  theorem  is  proved.  C 
3.2.3.  AT/logA  bound 

We  shall  now  consider  the  constraints  posed  by  computational  friction  on 

sorting  long  keys  and  obtain  a  new  lower  bound  which  turns  out  to  be  stronger 
2 

than  both  the  AT  and  the  AT  bounds  for  k  large  enough  and  for  a  suitable 
range  of  computation  times. 

Theorem  11.  Any  VLSI  (n,k)-sorter ,  with  k  >_  2  logn,  satisfies  the  bound 


AT  _>  SQnk  log(nk/T) 


for  some  constant  Sq  >  0.  Equivalently, 


AT/lcgA  »  (nk)  . 


( 


Proof .  We  want  to  apply  Theorem  6  to  the  set  of  input  variables 

{x|  :0<_i<n,  0<_j<  k-logn} .  To  the  set  ^  of  elements  of  l(  input 
exactly  at  time  t,  we  associate  the  set  3^  of  the  L|^t|/2J  least  significant 
variables  in  (Y^  :  X^  €  Ties  (variables  pertaining  to  the  same  bit 

position)  are  broken  arbitrarily.  All  the  variables  in  y  are  functionally 
dependent  upon  the  TjV  j/2l  most  significant  variables  of  so  that  - 

k— 1 

in  the  terminology  of  Theorem  6  -  a(s)  ■  Is / 2l  .  If  we  set  the  bits  X^  . .  .X 
to  be  the  binary  representation  of  i,  y|  ■  X^  for  all  i's  and  j's,  and  2^ 
trivially  carries  \*U  \/2  bits  of  information  about  so  that  S(s)  ■  Ls/2J  . 
Thus,  all  the  assumptions  of  Theorem  6  and  Corollary  2  are  satisfied,  and 
bounds  (46)  and  (47)  follow  from  (20)  and  (21)  with  [“Uj  -  9(nk) .  G 
3,2.4  Remarks 


From  Theorem  9  and  Theorem  10  we  know  that  there  exis :  constants  3^  and  3 

such  that  Che  performance  of  any  (n  ,k)-sorter ,  with  k  _>  2'logn  satisfies  the 

2 

bounds  AT  >_  S^knVnlogn,  and  AT  32kn(nlogn) .  These  bounds  coincide  at  time 
Tq  *  (3^/3^^fclogn.  The  AT  bound  is  stronger  for  T  >  T^,  and  the  AT^  bound  is 


stronger  for  T  <  T_, 


If  we  compare  the  friction  bound  (46)  with  the  saturation  bound  (38), 
written  as  AT  >  3^ knVnlogn  ,  we  see  th_t  the  former  is  stronger  when 


k/T  ’  g(n)/n,  with  g(n)  5  2 


\  (3,  / 3  )  l'nlogn 


If  we  now  consider  Che  crude  fan-ir 


argument  chat  prescribes  T  xlog(kn)  (for  a  suitable  constant  t),  we  see 
that  k/T  >  g(n)/n  is  satisfied  by  feasible  computation  times  only  if  k/T  >  kg/ 
where  kg  ■  3 (g(n)Vlogn/n)  is  the  solution  of  k/('log(kn))  *  g(n)/n,  and 
Tn  -  3(Vnlogn)  (see  Figure  2). 


3.5  Summary 


We  conclude  this  section  with  a  summary  of  known  lower  bounds  on  sorting 

given  in  Table  I.  Bounds  on  T  follow  from  trivial  fan-in  arguments.  Bounds 

on  A  are  due  to  Leighton  [  11]  for  long  keys,  and  to  Siegel  [  12]  for  short  and 

intermediate-length  keys.  Bisection  bounds  on  Ax  for  long  keys  are  due  to 

Thompson  [1]  (for  word-level  protocols)  and  Leighton  f 11]  (for  arbitrary 

protocols) .  For  short  and  medium-length  keys  bisection  bounds  are  due  to 

Sie?el  [12,13]  .  An  alternative  proof  of  AT^  *  n(n^h^),  based  on  primary  and 

secondary  flow,  is  given  in  [ 14] .  The  remaining  bounds  are  those  given  in  this 
2 

paper.  The  AT  *  fl(nr)  result  has  been  independently  obtained  in  [  12] . 

The  above  bounds  show  that  the  area-time  complexity  of  sorting  is  determined 
by  different  computational  mechanisms,  each  of  which  dominates  for  a  particular 
range  of  keylengths  and  computation  times.  An  effective  overview  of  the 
different  bounds  is  provided  by  Figure  2. 

All  the  lower  bounds  of  Table  I  are  known  to  be  tight  or  nearly  tight. 
Several  optimal  circuits  for  (n, logn+9 (logn) )-sorting,  (the  first  case  of 
sorting  analyzed  in  the  VLSI  model,  which  partially  overlaps  with  sorting  of 
intermediate-length  and  long  keys),  have  been  proposed  by  the  authors  [15,16,17] 

and  by  Leighton  [11] .  Optimal  circuits  for  keys  of  any  intermediate  length  are  g 
in  [  14] .  Constructions  that  attain  the  AT  ,  AT  and  AT/logA  bounds  for  lung  keys, 
as  well  as  constructions  that  are  near  optimal  for  short  keys  are  described 
in  [  18] .  Further  optimal  designs  for  several  key  lengths  are  reported  by 
Cole  and  Siegel  [19]  (personal  communication).  Several  designs  of  VLSI  sorters, 
potentially  of  practical  interest  even  when  asymptotically  suboptimal,  are 
surveyed  by  Thompson  [ 20] .  A  systematic  discussion  of  VLSI  sorting  can  be 
found  in  [  14]  . 


SUMMARY  OF  LOWER  BOUNDS  FOR  (a  Jc  ^sorting 


Length  of  the 
Lower  Bound  Keys 

Techniques 


Bi  partition 
♦ 

Information  Exchange 


Square  Tesselation 
♦ 

Information  Exchange 


Square  Tesselation 
+ 

Saturated  Information 


Fnction 


1  Zlogn 
(r  *  2l  ) 


logn  <k  <2 logn 
(A  *  k  "loan  ) 


2 logn 


Bounded  Fan-in 


AT2*  (Kr^^l+n/r))  A7*2*n(n2A2)  AT 2  *  Clin  2log2n  ) 


AT 2  =  Qinr  ) 


AT  •  Q(n  s/7) 


t  »  ClUogn ) 


A7*  =  Clink  inlogn  )) 


AT  *  Clink  nlogn  ) 


=  QOi*) 


A  *  Qirlog  (l+n  !r  ))  A  *  Clinh)  A  *  Clinlogn) 


T  a  QUogn  )  T  =*  Clilogn  +logk  ) 


AT  •  Q(k(nlogn) 


short  — incerap  long  -  »{« — 

Figure  2  Regions  of  the  (k,T)  plane  and  their  dominant  regimes 


4 .  Other  Applications 

The  lower-bound  techniques  of  Section  2  afford  the  analysis  of  several  other 
problems  beside  those  discussed  in  Section  3.  A  few  results  are  stated  and 
commented  on  below.  The  proofs  are  similar  in  flavor  to  those  given  in 
Section  3,  and  are  omitted  here  for  the  sake  of  brevity. 

CYCLIC  SHIFT.  The  input  of  the  (n,k)-cyclic-shif t  problem  is  a  pair  (X,s), 

where  X  is  an  n  *  k  binary  array,  s  €  {0,1, . . . ,n-l} ,  and  the  output  is  a  binary 

array  Y  such  that  Y.  ■  X,.  *  , 

j  (j-s)mod  n 

Theorem  12.  Any  VLSI  (n,k)-cyclic-shif ter  satisfies  the  bounds 

AT2  -  0(kn2)  ,  (48a) 

AT  -  0(kn3/2)  .  (48b) 

Comments.  A  detailed  proof  of  Theorem  15  is  given  in  [  14]  .  Result  (48a)  has 
also  been  reported  in  [  4]  . 

MERGING.  The  (n,k)-merging  problem  is  the  specialization  of  (n ,k)-sorting  when 

the  input  subsequences  (Xn,...,X  .)  and  (X  X  .)  are  sorted. 

u  n/ 1—  i  n  /  4  n— i 

Theorem  13.  Any  VLSI  (n,k) -merger  satisfies  t  i  following  bounds.  For 
A  1c 

k  <_  logn,  and  r  ■  2  : 

AT2  -  ft(nr)  , 

AT  «  Q(nr1/2)  . 

For  k  >  logn: 

AT2  »  f.((k-logn)n~)  , 

AT  ■  ft(  (k-logn)n*"  2)  , 


(49  a) 
(49b) 

(49c) 

(49d) 


AT/lcgA  ■  H((k-logn)n)  . 


(49e) 


Comments .  Bounds  (49a),  (49b)  and  (49e)  are  identical  in  the  order  to  those 
obtained  for  (n,k)-sorting.  Bounds  (49c)  and  (49d)  are  a  factor  of  lofea 
smaller.  The  reason  is  that  while  primary  flow  is  of  the  same  order  in  both 
problems,  the  secondary  flow  is  a  factor  of  logn  smaller  for  merging.  Indeed, 
9(n)  bits  are  necessary  and  sufficient  to  specify  a  merging  permutation. 

RECORD  SORTING.  A  formulation  of  sorting,  more  general  than  the  one  considered 
in  Section  2,  assumes  that  the  n  items  to  be  sorted  are  records  of  two  fields, 
the  key  (of  k  bits)  and  the  information  (of  p  bits).  The  output  is  the 
multiset  of  input  records  rearranged  in  order  of  nondecreasing  keys. 

Sorting  is  called  stable  when  records  with  the  same  key  preserve  in  the  output 
sequence  the  same  relative  order  they  had  in  the  input  sequence. 

Theorem  14.  A  VLSI  stable  sorter  of  records,  with  k  <_  p  and  k  <_  logn,  satisfies 
the  bounds 

AT2  *  fl(pn^k)  ,  (50a) 

AT  -  n(pn(nk)1/2)  .  (50b) 

Comments .  Proofs  are  similar  to  those  of  Theorems  9  and  10.  However,  observe 
that  there  is  no  analogous  to  Theorem  14,  since  the  bit  positions  of  the 
information  field  do  not  interact  with  each  other. 

The  lower  bound  techniques  of  this  paper  are  certainly  applicable  to 
other  problems,  and  we  hope  they  will  contribute  to  a  coherent  formulation  of 


■  ■> 
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